Introduction
In this report, we will analyze the TelcoChurn dataset, which
contains data on 7043 customers of a telecommunication company. The
dataset has various variables that describe the customers’ service and
billing information along with their usage patterns as categorical data.
One of the main variables of interest is Churn, which indicates whether
the customer left the company or not at the time of data collection. The
goal of this report is to explore the factors that influence customer
churn and to provide recommendations for improving customer
retention.
Loading Required Libraries
library(ggplot2)
library(plotly)
library(caret)
library(factoextra)
library(dplyr)
library(fpc)
library(dbscan)
library(Hmisc)
library(corrplot)
library(missForest)
library(cluster)
library(knitr)
library(FNN)
library(solitude)
Loading Data
telco_raw <- read.csv("TelcoChurn.csv")
str(telco_raw)
## 'data.frame': 7043 obs. of 22 variables:
## $ customerID : chr "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
## $ gender : chr "Female" "Male" "Male" "Male" ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : chr "Yes" "No" "No" "No" ...
## $ Dependents : chr "No" "No" "No" "No" ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : chr "No" "Yes" "Yes" "No" ...
## $ MultipleLines : chr "No phone service" "No" "No" "No phone service" ...
## $ InternetService : chr "DSL" "DSL" "DSL" "DSL" ...
## $ OnlineSecurity : chr "No" "Yes" "Yes" "Yes" ...
## $ OnlineBackup : chr "Yes" "No" "Yes" "No" ...
## $ DeviceProtection: chr "No" "Yes" "No" "Yes" ...
## $ TechSupport : chr "No" "No" "No" "Yes" ...
## $ StreamingTV : chr "No" "No" "No" "No" ...
## $ StreamingMovies : chr "No" "No" "No" "No" ...
## $ Contract : chr "Month-to-month" "One year" "Month-to-month" "One year" ...
## $ PaperlessBilling: chr "Yes" "No" "Yes" "No" ...
## $ PaymentMethod : chr "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : chr "No" "No" "Yes" "No" ...
## $ ChurnYes : int 0 0 1 0 1 1 0 0 1 0 ...
telco_raw <- telco_raw[,-1] #removing first column #the column of customerID is not important so i will drop it
Let’s see if our data has missing values.
colSums(is.na(telco_raw))
## gender SeniorCitizen Partner Dependents
## 0 0 0 0
## tenure PhoneService MultipleLines InternetService
## 0 0 0 0
## OnlineSecurity OnlineBackup DeviceProtection TechSupport
## 0 0 0 0
## StreamingTV StreamingMovies Contract PaperlessBilling
## 0 0 0 0
## PaymentMethod MonthlyCharges TotalCharges Churn
## 0 0 11 0
## ChurnYes
## 0
Imputation for Missing Values
Upon examination, we notice that 11 customers in our dataset have
missing values for “Total charges” column. This corresponds to
approximately 0.01% of the total values (21 * 7043 = 147903) in the
dataset. Given this proportion, the impact on the overall variance is
deemed minimal, whether these values are imputed or left as is.
Consequently, opting for a sophisticated imputation method is considered
unnecessary. Instead, a pragmatic approach is taken by choosing to
impute these missing values with complete case imputation. This decision
serves to simplify the process, reducing computational complexity and
resource requirements.
telco_imp <- na.omit(telco_raw)
str(telco_imp)
## 'data.frame': 7032 obs. of 21 variables:
## $ gender : chr "Female" "Male" "Male" "Male" ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : chr "Yes" "No" "No" "No" ...
## $ Dependents : chr "No" "No" "No" "No" ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : chr "No" "Yes" "Yes" "No" ...
## $ MultipleLines : chr "No phone service" "No" "No" "No phone service" ...
## $ InternetService : chr "DSL" "DSL" "DSL" "DSL" ...
## $ OnlineSecurity : chr "No" "Yes" "Yes" "Yes" ...
## $ OnlineBackup : chr "Yes" "No" "Yes" "No" ...
## $ DeviceProtection: chr "No" "Yes" "No" "Yes" ...
## $ TechSupport : chr "No" "No" "No" "Yes" ...
## $ StreamingTV : chr "No" "No" "No" "No" ...
## $ StreamingMovies : chr "No" "No" "No" "No" ...
## $ Contract : chr "Month-to-month" "One year" "Month-to-month" "One year" ...
## $ PaperlessBilling: chr "Yes" "No" "Yes" "No" ...
## $ PaymentMethod : chr "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : chr "No" "No" "Yes" "No" ...
## $ ChurnYes : int 0 0 1 0 1 1 0 0 1 0 ...
## - attr(*, "na.action")= 'omit' Named int [1:11] 489 754 937 1083 1341 3332 3827 4381 5219 6671 ...
## ..- attr(*, "names")= chr [1:11] "489" "754" "937" "1083" ...
Preliminary Data Visualization
We kick off our analysis with a visual exploration of key variables
in the TelcoChurn dataset. Let’s begin by examining the data through
informative visualizations to uncover initial insights and patterns that
may illuminate the factors influencing customer churn. This will be
essential in formulating initial hypotheses for our data and
predictions.
Target Column - Churn
Let’s see the value counts of our target or reference.
props <- as.data.frame(prop.table(table(telco_imp$Churn))*100)
colnames(props) <- c("Churn Status", "Percentage")
props
## Churn Status Percentage
## 1 No 73.4215
## 2 Yes 26.5785
# Create a bar plot
ggplot(props, aes(x = `Churn Status`, y = Percentage, fill = `Churn Status`)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Churn Rate",
x = "Churn Status",
y = "Percentage") +
scale_fill_manual(values = c("lightgreen", "steelblue"))

Here, we can see that the churn rate is about 26.54% in our actual
dataset.
Total Charges, Monthly Charges and Tenure - Churn Rate Plot
plot_tc <- ggplot(telco_imp, aes(x = TotalCharges, fill = Churn)) +
geom_density(alpha = 0.5) +
labs(title = "Density Plot of Total Charges by Churn Status",
x = "Total Charges",
y = "Density") +
scale_fill_manual(values = c("No" = "green", "Yes" = "steelblue"))
plot_mc <- ggplot(telco_imp, aes(x = MonthlyCharges, fill = Churn)) +
geom_density(alpha = 0.5) +
labs(title = "Density Plot of Monthly Charges by Churn Status",
x = "Monthly Charges",
y = "Density") +
scale_fill_manual(values = c("No" = "green", "Yes" = "steelblue"))
plot_tenure <- ggplot(telco_imp, aes(x = tenure, fill = Churn)) +
geom_density(alpha = 0.5) +
labs(title = "Density Plot of Tenure by Churn Status",
x = "Tenure",
y = "Density") +
scale_fill_manual(values = c("No" = "green", "Yes" = "steelblue"))
annotations <- list(
list(
x = 0.15,
y = -0.1,
text = "Total Charges ($)",
xref = "paper",
yref = "paper",
xanchor = "center",
yanchor = "bottom",
showarrow = FALSE
),
list(
x = 0.5,
y = -0.1,
text = "Monthly Charges ($)",
xref = "paper",
yref = "paper",
xanchor = "center",
yanchor = "bottom",
showarrow = FALSE
),
list(
x = 0.84,
y = -0.1,
text = "Tenure (in months)",
xref = "paper",
yref = "paper",
xanchor = "center",
yanchor = "bottom",
showarrow = FALSE
)
)
plotly::subplot(plot_tc, plot_mc, plot_tenure) %>%
layout(annotations = annotations, title = "Density of Churn Across Total Charges, Monthly Charges, and Tenure", titlefont = list(size = 16))
Churn Rate by Senior Citizens
seniors <- as.factor(telco_imp$SeniorCitizen)
ggplot(telco_imp, aes(x = seniors, fill = Churn)) +
geom_bar(position = "dodge") +
labs(title = "Churn Rate by SeniorCitizen",
x = "Senior Citizen",
y = "Count") +
scale_fill_manual(values = c("No" = "lightgreen", "Yes" = "steelblue"))
## Churn Rate by Contract Type
ggplot(telco_imp, aes(x = Contract, fill = Churn)) +
geom_bar(position = "dodge") +
labs(title = "Churn Rate by Contract Type",
x = "Contract Type",
y = "Count") +
scale_fill_manual(values = c("No" = "lightgreen", "Yes" = "steelblue"))

Churn Rate by Gender
ggplot(telco_imp, aes(x = gender, fill = Churn)) +
geom_bar(position = "dodge") +
labs(title = "Churn Rate by Gender",
x = "Gender",
y = "Count") +
scale_fill_manual(values = c("No" = "lightgreen", "Yes" = "steelblue"))

Initial Hypothesis
Analyzing the above data visualizations, we can make the following
initial hypothses:
- Gender Influence: Gender appears to have a
negligible impact on churn rates.
- Seniority Impact: The churn rate for senior
citizens is approximately twice as high compared to younger
demographics.
- Contract Duration: Customers opting for
month-to-month contracts exhibit higher churn rates in contrast to those
with yearly contracts.
- Monthly and Total Charges Dynamics: A notable trend
indicates that the churn rate tends to increase when monthly charges are
high. However, customers with higher total charges seem to be less
inclined to leave the company.
- Tenure: Customers with longer tenure are less
inclined to leave the company as well.
These hypotheses provide a preliminary foundation for further
investigation and analysis.
Data Cleaning, Variable Selection, and Scaling The Data
Now that we have established our initial hypothesis using the raw
data, we can start our analysis. Next, we’ll streamline our dataset for
unsupervised learning models by excluding the dependent variables. We
will, however, later use the ChurnYes variable, which is a binary
variable indicating whether the customer left the company or not, in the
confusion matrix analysis as reference.
To enhance our dataset’s suitability for cluster analysis, we’ll also
transform categorical/factor variables into dummy variables. This not
only ensures compatibility but also facilitates the scaling process for
subsequent analysis.
telco_clean <- subset(telco_imp, select = -c(Churn, ChurnYes))
telco_clean <- as.data.frame(model.matrix(~ . - 1, data = telco_clean))
str(telco_clean)
## 'data.frame': 7032 obs. of 31 variables:
## $ genderFemale : num 1 0 0 0 1 1 0 1 1 0 ...
## $ genderMale : num 0 1 1 1 0 0 1 0 0 1 ...
## $ SeniorCitizen : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PartnerYes : num 1 0 0 0 0 0 0 0 1 0 ...
## $ DependentsYes : num 0 0 0 0 0 0 1 0 0 1 ...
## $ tenure : num 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneServiceYes : num 0 1 1 0 1 1 1 0 1 1 ...
## $ MultipleLinesNo phone service : num 1 0 0 1 0 0 0 1 0 0 ...
## $ MultipleLinesYes : num 0 0 0 0 0 1 1 0 1 0 ...
## $ InternetServiceFiber optic : num 0 0 0 0 1 1 1 0 1 0 ...
## $ InternetServiceNo : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OnlineSecurityNo internet service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OnlineSecurityYes : num 0 1 1 1 0 0 0 1 0 1 ...
## $ OnlineBackupNo internet service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OnlineBackupYes : num 1 0 1 0 0 0 1 0 0 1 ...
## $ DeviceProtectionNo internet service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ DeviceProtectionYes : num 0 1 0 1 0 1 0 0 1 0 ...
## $ TechSupportNo internet service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TechSupportYes : num 0 0 0 1 0 0 0 0 1 0 ...
## $ StreamingTVNo internet service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ StreamingTVYes : num 0 0 0 0 0 1 1 0 1 0 ...
## $ StreamingMoviesNo internet service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ StreamingMoviesYes : num 0 0 0 0 0 1 0 0 1 0 ...
## $ ContractOne year : num 0 1 0 1 0 0 0 0 0 1 ...
## $ ContractTwo year : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PaperlessBillingYes : num 1 0 1 0 1 1 1 0 1 0 ...
## $ PaymentMethodCredit card (automatic): num 0 0 0 0 0 0 1 0 0 0 ...
## $ PaymentMethodElectronic check : num 1 0 0 0 1 1 0 0 1 0 ...
## $ PaymentMethodMailed check : num 0 1 1 0 0 0 0 1 0 0 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
colSums(is.na(telco_clean)) #checking
## genderFemale genderMale
## 0 0
## SeniorCitizen PartnerYes
## 0 0
## DependentsYes tenure
## 0 0
## PhoneServiceYes MultipleLinesNo phone service
## 0 0
## MultipleLinesYes InternetServiceFiber optic
## 0 0
## InternetServiceNo OnlineSecurityNo internet service
## 0 0
## OnlineSecurityYes OnlineBackupNo internet service
## 0 0
## OnlineBackupYes DeviceProtectionNo internet service
## 0 0
## DeviceProtectionYes TechSupportNo internet service
## 0 0
## TechSupportYes StreamingTVNo internet service
## 0 0
## StreamingTVYes StreamingMoviesNo internet service
## 0 0
## StreamingMoviesYes ContractOne year
## 0 0
## ContractTwo year PaperlessBillingYes
## 0 0
## PaymentMethodCredit card (automatic) PaymentMethodElectronic check
## 0 0
## PaymentMethodMailed check MonthlyCharges
## 0 0
## TotalCharges
## 0
Great, we have taken care of missing values and have also converted
all factor/categorical variables into dummy variables.
However, we do see that there are spaces and
characters(parentheses) that R might find unusual. This may cause errors
in our analysis and coding later on. Let’s take care of it
now.
colnames(telco_clean) <- gsub("\\(automatic\\)", "", colnames(telco_clean)) #the word automatic adds no meaning or value to payment method variable
colnames(telco_clean) <- gsub(" ","_",colnames(telco_clean)) #sustituting space with underscore
str(telco_clean)
## 'data.frame': 7032 obs. of 31 variables:
## $ genderFemale : num 1 0 0 0 1 1 0 1 1 0 ...
## $ genderMale : num 0 1 1 1 0 0 1 0 0 1 ...
## $ SeniorCitizen : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PartnerYes : num 1 0 0 0 0 0 0 0 1 0 ...
## $ DependentsYes : num 0 0 0 0 0 0 1 0 0 1 ...
## $ tenure : num 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneServiceYes : num 0 1 1 0 1 1 1 0 1 1 ...
## $ MultipleLinesNo_phone_service : num 1 0 0 1 0 0 0 1 0 0 ...
## $ MultipleLinesYes : num 0 0 0 0 0 1 1 0 1 0 ...
## $ InternetServiceFiber_optic : num 0 0 0 0 1 1 1 0 1 0 ...
## $ InternetServiceNo : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OnlineSecurityNo_internet_service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OnlineSecurityYes : num 0 1 1 1 0 0 0 1 0 1 ...
## $ OnlineBackupNo_internet_service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OnlineBackupYes : num 1 0 1 0 0 0 1 0 0 1 ...
## $ DeviceProtectionNo_internet_service: num 0 0 0 0 0 0 0 0 0 0 ...
## $ DeviceProtectionYes : num 0 1 0 1 0 1 0 0 1 0 ...
## $ TechSupportNo_internet_service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TechSupportYes : num 0 0 0 1 0 0 0 0 1 0 ...
## $ StreamingTVNo_internet_service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ StreamingTVYes : num 0 0 0 0 0 1 1 0 1 0 ...
## $ StreamingMoviesNo_internet_service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ StreamingMoviesYes : num 0 0 0 0 0 1 0 0 1 0 ...
## $ ContractOne_year : num 0 1 0 1 0 0 0 0 0 1 ...
## $ ContractTwo_year : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PaperlessBillingYes : num 1 0 1 0 1 1 1 0 1 0 ...
## $ PaymentMethodCredit_card_ : num 0 0 0 0 0 0 1 0 0 0 ...
## $ PaymentMethodElectronic_check : num 1 0 0 0 1 1 0 0 1 0 ...
## $ PaymentMethodMailed_check : num 0 1 1 0 0 0 0 1 0 0 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
Min-Max Scaling
Given the diverse range of categorical (dummy) variables,
min-max scaling is employed to transform these features into a common
scale, ensuring each variable contributes equally to our analyses. This
normalization technique brings uniformity to the data, preventing the
undue influence of variables with different scales and fostering more
robust and reliable results in subsequent analyses
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
telco_scaled <- as.data.frame(lapply(telco_clean, normalize))
summary(telco_scaled)
## genderFemale genderMale SeniorCitizen PartnerYes
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :1.0000 Median :0.0000 Median :0.0000
## Mean :0.4953 Mean :0.5047 Mean :0.1624 Mean :0.4825
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## DependentsYes tenure PhoneServiceYes
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.1127 1st Qu.:1.0000
## Median :0.0000 Median :0.3944 Median :1.0000
## Mean :0.2985 Mean :0.4426 Mean :0.9033
## 3rd Qu.:1.0000 3rd Qu.:0.7606 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## MultipleLinesNo_phone_service MultipleLinesYes InternetServiceFiber_optic
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.0967 Mean :0.4219 Mean :0.4403
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## InternetServiceNo OnlineSecurityNo_internet_service OnlineSecurityYes
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.2162 Mean :0.2162 Mean :0.2865
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## OnlineBackupNo_internet_service OnlineBackupYes
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000
## Mean :0.2162 Mean :0.3449
## 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000
## DeviceProtectionNo_internet_service DeviceProtectionYes
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000
## Mean :0.2162 Mean :0.3439
## 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000
## TechSupportNo_internet_service TechSupportYes StreamingTVNo_internet_service
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.2162 Mean :0.2901 Mean :0.2162
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## StreamingTVYes StreamingMoviesNo_internet_service StreamingMoviesYes
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.3844 Mean :0.2162 Mean :0.3884
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## ContractOne_year ContractTwo_year PaperlessBillingYes
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :1.0000
## Mean :0.2093 Mean :0.2396 Mean :0.5927
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## PaymentMethodCredit_card_ PaymentMethodElectronic_check
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000
## Mean :0.2163 Mean :0.3363
## 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000
## PaymentMethodMailed_check MonthlyCharges TotalCharges
## Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.1725 1st Qu.:0.04416
## Median :0.0000 Median :0.5184 Median :0.15909
## Mean :0.2281 Mean :0.4632 Mean :0.26131
## 3rd Qu.:0.0000 3rd Qu.:0.7126 3rd Qu.:0.43572
## Max. :1.0000 Max. :1.0000 Max. :1.00000
Should we do PCA to reduce dimensionality? Why or why not?
Let’s do a Primary Component Analysis (PCA) and see the results to
determine if it will be helpful.
## [1] 7032 31
This shows that our dataset has 7032 rows and 31 variables as of
now.
Primary Component Analysis (PCA)
We will perform a PCA and select the number of principal components
that explain about 90% of the total variance in the data. This will
allow us to reduce the dimensionality and complexity of the data while
retaining most of the information.
telco.pca <- prcomp(telco_scaled)
summary(telco.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.3109 0.7871 0.7074 0.65010 0.52230 0.48191 0.46287
## Proportion of Variance 0.3005 0.1083 0.0875 0.07391 0.04771 0.04061 0.03747
## Cumulative Proportion 0.3005 0.4088 0.4963 0.57022 0.61793 0.65854 0.69601
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.45467 0.43657 0.42126 0.39339 0.38384 0.36974 0.36393
## Proportion of Variance 0.03615 0.03333 0.03103 0.02706 0.02576 0.02391 0.02316
## Cumulative Proportion 0.73216 0.76549 0.79652 0.82358 0.84935 0.87326 0.89642
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.35445 0.33180 0.31313 0.29986 0.2789 0.22867 0.18342
## Proportion of Variance 0.02197 0.01925 0.01715 0.01572 0.0136 0.00914 0.00588
## Cumulative Proportion 0.91839 0.93764 0.95479 0.97051 0.9841 0.99326 0.99914
## PC22 PC23 PC24 PC25 PC26 PC27
## Standard deviation 0.06951 0.009603 3.297e-14 5.943e-16 4.963e-16 1.245e-16
## Proportion of Variance 0.00085 0.000020 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion 0.99998 1.000000 1.000e+00 1.000e+00 1.000e+00 1.000e+00
## PC28 PC29 PC30 PC31
## Standard deviation 1.245e-16 1.245e-16 1.245e-16 6.659e-17
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00 1.000e+00
In the above summary, we see that we can obtain about 90% of the
variance using 15 Primary components. We have effectively reduced our
dataset from 31 columns to 15. Let’s create a new dataset with these 15
PCs for our analysis.
PCA Visualization
library(ggbiplot)
ggbiplot(telco.pca)

Here we see that ContractTwo_year, DependentsYes, PhoneServiceYes,
PaymentMethods, along with some other variables contribute the highest
to PC1. Given that PC1 alone explains about 30.1% of the variance in our
dataset, we can also say that these variables are substantial in
characterizing our customers.
Cumulative Variance Plot
cumulativepro <- cumsum(telco.pca$sdev^2 / sum(telco.pca$sdev^2))
plot(cumulativepro, type = "l", xlab = "PC #", ylab = "Amount of explained variance", main = "Cumulative variance plot")
abline(h = 0.9, col="blue", lty=5)

The cumulative variance plot further illustrates our choice of 15
PCs.
PCA Dataframe
telco.pca_df <- as.data.frame(telco.pca$x[,1:15])
summary(telco.pca_df)
## PC1 PC2 PC3 PC4
## Min. :-1.7554 Min. :-1.59630 Min. :-0.7990 Min. :-2.11968
## 1st Qu.:-0.9001 1st Qu.:-0.59522 1st Qu.:-0.7189 1st Qu.:-0.40535
## Median :-0.4684 Median :-0.04814 Median : 0.6378 Median : 0.09843
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.1208 3rd Qu.: 0.57155 3rd Qu.: 0.6940 3rd Qu.: 0.47937
## Max. : 2.6244 Max. : 2.11816 Max. : 0.8220 Max. : 1.29461
## PC5 PC6 PC7 PC8
## Min. :-1.31894 Min. :-1.7439 Min. :-1.29676 Min. :-1.57258
## 1st Qu.:-0.40386 1st Qu.:-0.2769 1st Qu.:-0.22153 1st Qu.:-0.27902
## Median :-0.03369 Median :-0.0146 Median : 0.09828 Median : 0.02988
## Mean : 0.00000 Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.39564 3rd Qu.: 0.3075 3rd Qu.: 0.31956 3rd Qu.: 0.30623
## Max. : 1.50058 Max. : 1.3817 Max. : 1.12561 Max. : 1.32613
## PC9 PC10 PC11 PC12
## Min. :-1.1800 Min. :-1.34761 Min. :-1.294965 Min. :-1.42558
## 1st Qu.:-0.2879 1st Qu.:-0.28289 1st Qu.:-0.249400 1st Qu.:-0.25347
## Median : 0.0266 Median : 0.01177 Median : 0.009472 Median :-0.01144
## Mean : 0.0000 Mean : 0.00000 Mean : 0.000000 Mean : 0.00000
## 3rd Qu.: 0.2497 3rd Qu.: 0.27639 3rd Qu.: 0.236322 3rd Qu.: 0.24679
## Max. : 1.2822 Max. : 1.41890 Max. : 1.580278 Max. : 1.37524
## PC13 PC14 PC15
## Min. :-1.08525 Min. :-1.15657 Min. :-1.20127
## 1st Qu.:-0.25608 1st Qu.:-0.25322 1st Qu.:-0.19082
## Median : 0.04359 Median : 0.03605 Median : 0.01886
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.19056 3rd Qu.: 0.20133 3rd Qu.: 0.18024
## Max. : 1.16017 Max. : 1.43827 Max. : 1.26639
Final Thoughts for PCA
Principal Component Analysis (PCA) is a valuable technique for
dimensionality reduction, particularly in scenarios where datasets
exhibit high dimensionality, such as this one. In the case of the Telco
Churn dataset, which involves numerous features related to customer
behavior and characteristics, PCA could be considered. PCA may offer
benefits such as mitigating multicollinearity, handling high-dimensional
data, and potentially revealing underlying patterns. However, it is
crucial to weigh these advantages against the interpretability of the
original features. The Telco Churn dataset involves variables that are
directly tied to customer attributes and behaviors, making their
interpretability paramount. Additionally, the dataset size and the
desire to maintain the specific meaning of each feature may impact the
suitability of PCA. As PCA assumes linear relationships between
variables, it is essential to assess whether such assumptions align with
the underlying patterns in the data. A thoughtful evaluation of the
trade-offs between dimensionality reduction and interpretability should
guide the decision on whether to employ PCA in the Telco Churn dataset
analysis.
Hence, considering the Telco Churn dataset’s context, where the
telecom company seeks detailed insights into the specific customer
profiles associated with churn, opting against PCA may be a
prudent choice. PCA inherently transforms original features
into linear combinations, potentially diluting the individual
interpretability of each variable. In this scenario, maintaining the
fidelity of the original features becomes paramount, as the company aims
to understand the nuanced characteristics contributing to customer
churn. The absence of PCA allows for a more direct examination of the
relationships between individual features and churn behavior, providing
a clearer and more granular understanding of the factors influencing
customer decisions. This decision aligns with the company’s strategic
goal of tailoring retention strategies to the unique attributes of
customers prone to churning, enhancing the efficacy of targeted
interventions.
K-Means Cluster Model
Let’s start with finding the optimal number of clusters for this
dataset. As instructed, we will use the raw data that is cleaned and
scaled, and not the PCA dataset for our analysis.
Optimal Cluster Size
fviz_nbclust(telco_scaled, kmeans, method = "silhouette")

fviz_nbclust(telco_scaled, kmeans, method = "wss")

fviz_nbclust(telco.pca_df, kmeans, method = "gap_stat", nboot=10)

Here, we see that Silhoutte method recommends an optimal cluster size
of 2. The WSS method also has an elbow at 2 clusters. However, the gap
stat method recommends a cluster size of 3. Let’s visualize our cluster
to see if 3 clusters make sense for our data.
K-Means Cluster Visualization - 2 Clusters
set.seed(12345)
km <- kmeans(telco_scaled, 3)
fviz_cluster(km, data = telco_scaled) ## cluster visualization

Upon examining our clustered data, it becomes apparent that clusters
1 and 3 exhibit some degree of overlap, hinting at potential
similarities or shared characteristics. Despite this, the overall
distinctiveness of the clusters remains discernible.
Let’s visualize our cluster to see if 2 clusters make
sense for our data.
K-Means Cluster Visualization
km2 <- kmeans(telco_scaled, 2)
fviz_cluster(km2, data = telco_scaled) ## cluster visualization

The visual inspection of our clusters indeed reveals distinct
patterns, suggesting meaningful segmentations within our dataset.
However, given the substantial size of our dataset, opting for 3
clusters seems appropriate. This choice allows for a nuanced
characterization of our customer base, leveraging the additional
granularity provided by a higher number of clusters. The decision to
proceed with three clusters is driven by the aim to glean richer
insights into customer behavior, fostering a more detailed and
comprehensive understanding that aligns with the diverse profiles
present in our data.
K- Means Cluster Interpretation
For our interpretation and understanding of customer groups, let’s
look at the raw numerical averages, before scaling, of different
variables. This will help us build customer profiles based on different
clusters and identify certain clusters that may be
“churn-sensitive”.
##
## 1 2 3
## 803 5512 717
telco_imp$cluster <- km$cluster
result <- telco_imp %>%
group_by(cluster) %>%
dplyr::summarize(across(where(is.numeric), mean, na.rm = TRUE),
across(where(is.character), ~names(which.max(table(.))))
)
print(result)
## # A tibble: 3 × 22
## cluster SeniorCitizen tenure MonthlyCharges TotalCharges ChurnYes gender
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1 0.0399 20.3 20.6 425. 0.111 Male
## 2 2 0.198 32.9 76.9 2730. 0.319 Male
## 3 3 0.0279 42.3 21.6 934. 0.0335 Female
## # ℹ 15 more variables: Partner <chr>, Dependents <chr>, PhoneService <chr>,
## # MultipleLines <chr>, InternetService <chr>, OnlineSecurity <chr>,
## # OnlineBackup <chr>, DeviceProtection <chr>, TechSupport <chr>,
## # StreamingTV <chr>, StreamingMovies <chr>, Contract <chr>,
## # PaperlessBilling <chr>, PaymentMethod <chr>, Churn <chr>
Here, examining the average values within each cluster unveils
distinctive cluster characteristics:
| 1 |
803 Customers |
Cluster 1 primarily comprises a relatively
lower concentration of senior citizens and relatively
newer customers with the shortest tenure. These individuals
have the lowest average monthly charges compared to
other clusters and also demonstrate the lowest average total
charges, indicating minimal spending with the company.
Typically, these customers are not enrolled in paperless
billing, prefer mailed checks for payment, and
often have a single phone line with no internet service
through Telcom. They are likely to have a month-to-month
contract and be single males without dependents or a
partner. |
| 2 |
5512 Customers |
Cluster 2 is characterized by the highest
churn rate, a relatively higher concentration of senior
citizens and slightly older customers with better
tenure than those in Cluster 1. On average, these customers
have the highest monthly charges and total charges
compared to other clusters. They tend to prefer paperless
billing and often pay through Electronic
Check. Additionally, they typically have more than one
phone line and subscribe to internet services through
Telcom. These customers are likely to have a
month-to-month contract and be single males
without dependents or a partner. |
| 3 |
717 Customers |
Cluster 3 exhibits the lowest concentration
of senior citizens and the oldest customers with the
highest tenure. On average, these customers have lower
monthly charges compared to other clusters and comparatively
lower total charges. Similar to Cluster 1, they are not
typically enrolled in paperless billing and often pay
through mailed checks. They usually have two
year contracts and a single phone line and no internet
service through Telcom. These customers are likely to be
females with dependents or a partner. |
Churn Likely Customers
In our analysis, we have identified Cluster 2 as having the highest
churn rate among the three clusters. This cluster stands out as it
comprises customers who exhibit certain characteristics that make them
more prone to churn. Understanding the profile of churn-sensitive
customers is crucial for Telcom to implement targeted strategies aimed
at customer retention.
Cluster 2 Characteristics:
- Demographics: This cluster shows a relatively
higher concentration of senior citizens and customers with better tenure
than those in other clusters.
- Billing Preferences: Customers in this cluster tend
to prefer paperless billing and often choose to pay through Electronic
Check.
- Service Subscriptions: Cluster 2 customers
typically have more than one phone line and subscribe to internet
services through Telcom.
- Financial Behavior: On average, customers in this
cluster have the highest monthly charges and total charges compared to
other clusters.
This somewhat aligns with our initial hypothesis that senior
citizen customers with high monthly charges are more likely to churn
than other customers. We do see that our initial hypothesis that
customers with high total charges are less inclined to churn is not
correct, as per our k-means model. We also hypothesized that customers
with month to month contract have higher churn tendency which is
confirmed by our model.
Building the Churn-Sensitive Customer Profile:
Based on the characteristics of Cluster 2, Telcom can identify key
attributes associated with customers more likely to churn: 1.
Age Factor: Senior citizens and older customers within
this cluster might be more sensitive to service changes or pricing,
requiring tailored communication and incentives. 2. Billing and
Payment Preferences: The preference for paperless billing and
electronic payment methods indicates a technological inclination. Telcom
can explore personalized digital communication channels for this
segment. 3. Service Subscription Impact: Customers with
multiple phone lines and internet services may have higher expectations.
Offering targeted promotions or service enhancements can address their
needs.
By recognizing these patterns and understanding the distinct
behaviors of Cluster 2, Telcom can implement targeted retention
strategies. This may include personalized communications, loyalty
programs, or special offers to mitigate churn risk among customers with
similar characteristics.
Confusion Matrix
Let’s test the accuracy and sensitivity of our model using a
confusion matrix.
pred <- ifelse(telco_imp$cluster == 2, "Churn", "No Churn")
actual <- ifelse(telco_imp$ChurnYes == 1, "Churn", "No Churn")
cfmatrix <- confusionMatrix(data = factor(pred), reference = factor(actual))
cfmatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction Churn No Churn
## Churn 1756 3756
## No Churn 113 1407
##
## Accuracy : 0.4498
## 95% CI : (0.4381, 0.4615)
## No Information Rate : 0.7342
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1308
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9395
## Specificity : 0.2725
## Pos Pred Value : 0.3186
## Neg Pred Value : 0.9257
## Prevalence : 0.2658
## Detection Rate : 0.2497
## Detection Prevalence : 0.7838
## Balanced Accuracy : 0.6060
##
## 'Positive' Class : Churn
##
Our model’s performance metrics are indicative of its effectiveness
in capturing relevant patterns within the data. Specifically, the model
exhibits a commendable sensitivity of 0.94, reflecting its ability to
correctly identify instances of positive churn. However, it’s crucial to
note that the accuracy stands at 0.45, underscoring the need for further
exploration and potential refinement. The interplay between sensitivity
and accuracy warrants a nuanced examination to optimize our model for a
more balanced and reliable predictive outcome.
Conclusion
In this analysis, we applied the k-means clustering algorithm to the
TelcoChurn dataset, which contains data on 7043 customers of a
telecommunication company. The dataset has various variables that
describe the customers’ service, billing, and payment information along
with some usage data. The goal of this analysis is to explore the
factors that influence customer churn and to provide recommendations for
improving customer retention.
Before applying the k-means algorithm, we performed some data
preprocessing steps, such as removing 11 customers with missing values
for Total charges, converting the categorical variables into dummy
variables, and scaling the numeric variables using min-max scaling. We
then used the gap stat, silhoutte, and wss (elbow) method to determine
the optimal number of clusters, which was 3. We assigned each customer
to one of the three clusters based on their similarity to the cluster
centroids.
We also analyzed the characteristics of each cluster and found some
key insights that can help us understand the customer behavior and
preferences. Based on these findings, we can make some recommendations
for improving customer retention and satisfaction. For the first
cluster, we suggest that the company should offer more incentives and
discounts for loyal customers, such as free upgrades, rewards, or
referrals. The company should also provide more options and flexibility
for the internet service, such as different plans, speeds, or features.
The company should also communicate more frequently and effectively with
these customers, such as sending personalized emails, surveys, or
feedback requests. For the second cluster, we suggest that the company
should promote and educate the customers about the benefits and features
of the internet service, such as convenience, entertainment, or
security. The company should also offer some trial or introductory
offers for the internet service, such as free or discounted periods, or
bundles with other services. The company should also maintain and
improve the quality and reliability of the phone service, such as
reducing dropped calls, improving coverage, or providing customer
support.
We evaluated the performance of our model using two metrics: accuracy
and sensitivity. Accuracy measures how well the model can correctly
classify the customers into their respective clusters, while sensitivity
measures how well the model can identify the customers who churned. Our
model achieved an accuracy of 44.98% and a sensitivity of 93.95%. This
means that our model can correctly cluster about 45% of all the
customers, and can detect about 94% of the customers who left the
company.
In conclusion, the k-means clustering algorithm provided a useful way
to segment the customers based on their characteristics and behavior.
However, the model had a low accuracy rate, which suggests that there is
room for improvement. A larger and more diverse dataset could help to
train the model better and capture more variation in the customer
behavior. We could also experiment with different clustering algorithms
and models with a larger dataset.